Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Tablet Refresh Behavior in VReplication Traffic Switch Handling #10058

Merged
merged 14 commits into from
Apr 20, 2022

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Apr 7, 2022

Description

Because we refresh the tablets in source and target shards involved in a vreplication workflow during the operation (e.g. SwitchWrites) — in order to update the serving vschema and query execution rules — let's also do that for the dry run so that we can provide a warning for users that the operation could fail. Example output:

$ time vtctlclient --server=127.0.0.1:15999 MoveTables -- --dry_run --tablet_types=rdonly,replica SwitchTraffic customer.commerce2customer

Following vreplication streams are running for workflow customer.commerce2customer:

id=1 on 0/zone1-0000000200: Status: Running. VStream Lag: 0s.

MoveTables Error: rpc error: code = Unknown desc = cannot switch traffic for workflow commerce2customer at this time: could not refresh all of the tablets involved in the operation:
failed to successfully refresh all tablets in the customer/0 target shard (<nil>):
  failed to refresh tablet zone1-0000000202: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = "transport: Error while dialing dial tcp 172.31.41.160:16202: connect: connection refused"


real	0m30.081s
user	0m0.013s
sys	0m0.005s
✘-1 ~/git/vitess/examples/local [improve_vrepl_dry_run|…1]

The topotools.RefreshTabletsByShard call is best-effort and lets the caller know if the results were not complete (partial). We use a lower timeout within the traffic switcher when refreshing the state for each tablet (done in a goroutine) as:

  1. The default etcd lock/lease TTL is 60 seconds and if any of the tablets are unresponsive then the topo lock is likely to be lost by the time we go to release it
  2. The caller will know if the results were partial and can take appropriate action

We also add tablet refresh to the traffic switcher's pre-check call to ensure that we have healthy tablets on the source and target shards before we begin the operation as otherwise the operation is not safe and can fail. Example output:

$ time vtctlclient --server=127.0.0.1:15999 MoveTables -- --tablet_types=primary,rdonly,replica SwitchTraffic customer.commerce2customer 2>/dev/null

Following vreplication streams are running for workflow customer.commerce2customer:

id=1 on 0/zone1-0000000200: Status: Running. VStream Lag: 0s.

MoveTables Error: rpc error: code = Unknown desc = cannot switch traffic for workflow commerce2customer at this time: could not refresh all of the tablets involved in the operation:
failed to successfully refresh all tablets in the commerce/0 source shard:
  failed to refresh tablet zone1-0000000102: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = "transport: Error while dialing dial tcp 172.31.41.160:16102: connect: connection refused"


real	0m30.064s
user	0m0.008s
sys	0m0.009s
✘-1 ~/git/vitess/examples/local [improve_vrepl_dry_run|…1]

All of these things help to make vreplication workflows more predictable, more reliable, and incur less downtime.

Related Issue(s)

Checklist

@mattlord mattlord force-pushed the improve_vrepl_dry_run branch 9 times, most recently from c303c69 to 7eaec39 Compare April 11, 2022 15:41
@mattlord mattlord force-pushed the improve_vrepl_dry_run branch from 7eaec39 to 2f1f9ff Compare April 11, 2022 18:45
@mattlord mattlord changed the title Lower shard tablet refresh timeout and check this in vrepl dry runs Improve Tablet Refresh Behavior in VReplication Traffic Switch Handling Apr 11, 2022
@mattlord mattlord force-pushed the improve_vrepl_dry_run branch from 6a1489b to eb75634 Compare April 11, 2022 19:36
@mattlord mattlord force-pushed the improve_vrepl_dry_run branch 3 times, most recently from e7d1b54 to 67b1639 Compare April 15, 2022 16:51
@mattlord mattlord force-pushed the improve_vrepl_dry_run branch from 1e9f82f to 11fb7ca Compare April 19, 2022 04:54
This keeps the previous behavior of ignoring partial refresh related
errors, while providing the defails of WHY we had a partial refresh
for anyone that's interested.

Signed-off-by: Matt Lord <[email protected]>
@mattlord mattlord force-pushed the improve_vrepl_dry_run branch from 11fb7ca to 18f9573 Compare April 19, 2022 05:34
@mattlord mattlord marked this pull request as ready for review April 19, 2022 05:40
@mattlord
Copy link
Contributor Author

I ended up changing the function signature for topotools.RefreshTabletsByShard() so we can easily see where it's being called in the PR changes. The only thing left is going through each of those to see if we should START checking for partial results (they were ignored before). @rohit-nayak-ps or @deepthi happy to walk through that if either of you have opinions. The time were we could check and cancel is between:

  • The pre-check:
    reason, err := vrw.canSwitch(keyspace, workflowName)
    if err != nil {
    return nil, err
    }
    if reason != "" {
    return nil, fmt.Errorf("cannot switch traffic for workflow %s at this time: %s", workflowName, reason)
    }
  • The point of no return, where we create the journal records:
    • For SwitchReads: ... not clear we have one
    • For SwitchWrites:
      // This is the point of no return. Once a journal is created,
      // traffic can be redirected to target shards.
      if err := sw.createJournals(ctx, sourceWorkflows); err != nil {
      ts.Logger().Errorf("createJournals failed: %v", err)
      return 0, nil, err
      }

Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -121,7 +137,7 @@ func UpdateShardRecords(
// For 'to' shards, refresh to make them serve. The 'from' shards will
// be refreshed after traffic has migrated.
if !isFrom {
if _, err := RefreshTabletsByShard(ctx, ts, tmc, si, cells, logger); err != nil {
if _, _, err := RefreshTabletsByShard(ctx, ts, tmc, si, cells, logger); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called from a number of places via Wrangler.updateShardRecords(). I don't think that we can change the partial handling here w/o other changes.

This way the caller has that info and can decide what to do with
it (if anything).

Signed-off-by: Matt Lord <[email protected]>
@@ -354,7 +354,7 @@ func (wr *Wrangler) cancelHorizontalResharding(ctx context.Context, keyspace, sh

destinationShards[i] = updatedShard

if _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, nil, wr.Logger()); err != nil {
if _, _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, nil, wr.Logger()); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't make sense to return an error when canceling.

@@ -442,7 +442,7 @@ func (wr *Wrangler) MigrateServedTypes(ctx context.Context, keyspace, shard stri
refreshShards = destinationShards
}
for _, si := range refreshShards {
_, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, cells, wr.Logger())
_, _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, cells, wr.Logger())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is past the point of no return.

@@ -792,7 +792,7 @@ func (wr *Wrangler) masterMigrateServedType(ctx context.Context, keyspace string
}

for _, si := range destinationShards {
if _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, nil, wr.Logger()); err != nil {
if _, _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, si, nil, wr.Logger()); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is past the point of no return.

@@ -1226,7 +1226,7 @@ func (wr *Wrangler) replicaMigrateServedFrom(ctx context.Context, ki *topo.Keysp

// Now refresh the source servers so they reload the denylist
event.DispatchUpdate(ev, "refreshing sources tablets state so they update their denied tables")
_, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, sourceShard, cells, wr.Logger())
_, _, err := topotools.RefreshTabletsByShard(ctx, wr.ts, wr.tmc, sourceShard, cells, wr.Logger())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is past the point of no return.

_, err := topotools.RefreshTabletsByShard(ctx, ts.TopoServer(), ts.TabletManagerClient(), source.GetShard(), nil, ts.Logger())
rtbsCtx, cancel := context.WithTimeout(ctx, shardTabletRefreshTimeout)
defer cancel()
_, _, err := topotools.RefreshTabletsByShard(rtbsCtx, ts.TopoServer(), ts.TabletManagerClient(), source.GetShard(), nil, ts.Logger())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called when canceling.

_, err := topotools.RefreshTabletsByShard(ctx, ts.TopoServer(), ts.TabletManagerClient(), source.GetShard(), nil, ts.Logger())
rtbsCtx, cancel := context.WithTimeout(ctx, shardTabletRefreshTimeout)
defer cancel()
_, _, err := topotools.RefreshTabletsByShard(rtbsCtx, ts.TopoServer(), ts.TabletManagerClient(), source.GetShard(), nil, ts.Logger())
Copy link
Contributor Author

@mattlord mattlord Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a safe/good place to return an error if we could not refresh all tablets as its early on in the operation. I'll work on this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 90f656f

_, err := topotools.RefreshTabletsByShard(ctx, ts.TopoServer(), ts.TabletManagerClient(), target.GetShard(), nil, ts.Logger())
rtbsCtx, cancel := context.WithTimeout(ctx, shardTabletRefreshTimeout)
defer cancel()
_, _, err := topotools.RefreshTabletsByShard(rtbsCtx, ts.TopoServer(), ts.TabletManagerClient(), target.GetShard(), nil, ts.Logger())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is after the point of no return.

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! lgtm

@mattlord mattlord merged commit b5ca7a6 into vitessio:main Apr 20, 2022
@mattlord mattlord deleted the improve_vrepl_dry_run branch April 20, 2022 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Cluster management Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants